Physicochemistry

Depth Profiles: Dissolved Oxygen Depth Profiles

INSERT DESCRIPTION AND METHODS (NOELLE)


Physicochemical metrics by site

Response of coral to hypoxia:

Discuss with Maggie on what should be displayed here. Also Need shape files

16S rRNA Data Prep

1. Define sample groups: To prepare for downstream analysis, we first add group designations.

  1. The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (combo_pipeline.rdata) contains the ASV-by-sample table and the ASV taxonomy table.

  2. After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.

  1. And finally we define a sample data frame that holds the different groups we extracted from the sample names. On the right are the sample names and the different group categories.

Sample abbreviations are as follows:

X indicates the Environment

  • W = Water
  • S = Sediment
  • C = Coral
  • M = Mat

YY indicates the site name

  • CC = Coral Caye
  • CR = Cayo Roldan

Z indicates the replicate number

So MCR5 is a Mat sample from Cayo Roldan, replicate 5.


Table of samples

SamName ENV SITE
CCC1 Coral Coral Caye
CCC2 Coral Coral Caye
CCC3 Coral Coral Caye
CCC4 Coral Coral Caye
CCC5 Coral Coral Caye
CCC6 Coral Coral Caye
CCR1 Coral Cayo Roldan
CCR2 Coral Cayo Roldan
CCR3 Coral Cayo Roldan
CCR4 Coral Cayo Roldan
CCR5 Coral Cayo Roldan
CCR6 Coral Cayo Roldan
MCR1 Mat Cayo Roldan
MCR2 Mat Cayo Roldan
MCR3 Mat Cayo Roldan
MCR5 Mat Cayo Roldan
SCC1 Sediment Coral Caye
SCC2 Sediment Coral Caye
SCC3 Sediment Coral Caye
SCR1 Sediment Cayo Roldan
SCR2 Sediment Cayo Roldan
SCR3 Sediment Cayo Roldan
WCC0 Water Coral Caye
WCC1 Water Coral Caye
WCC2 Water Coral Caye
WCC3 Water Coral Caye
WCR0 Water Cayo Roldan
WCR1 Water Cayo Roldan
WCR2 Water Cayo Roldan
WCR3 Water Cayo Roldan

2. Create & modify a phyloseq object: a phyloseq (ps) object serves as the basis for all downstream analyses.

Create base phyloseq object

A. Create a phyloseq (ps) object using the silva (slv) taxonomy generated in the DADA2 workflow.

B. Rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.

[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"

C. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file.

D. Export sequence and taxonomy tables for the unadulterated phyloseq object to the DATA_TABLES directory for later use. We will use the prefix full to indicate that these are raw data products.

Remove contaminants & unwanted taxa

E. Remove any Mitochondria hits. Remember the original dataset contained 6160 ASVs.

To accomplish this we do the following:

  • Subset the taxa to generate a ps object of just Mitochondria,
  • Select the ASV column only, turn it into a factor, and use this to remove Mitochondria from the ps object.
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 6134 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 6134 taxa by 8 taxonomic ranks ]

Looks like this removed 26 Mitochondria ASVs. We will duplicate the code block to remove other groups. We will also get rid of any Eukaryota and Chloroplast ASVs.

F. Remove any Chloroplast hits.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5938 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5938 taxa by 8 taxonomic ranks ]

The code removed an additional 196 ASVs classified as Chloroplast.

G. Remove any Eukaryotic hits.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5933 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5933 taxa by 8 taxonomic ranks ]

Finally, 5 Eukaryota ASVs were eliminated from the ps object.

We will again export the sequence and taxonomy tables to the DATA_TABLES directory this time with the file prefix trim to indicate that contaminants have been removed from these data.


The raw phyloseq object.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 6160 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 6160 taxa by 8 taxonomic ranks ]

The raw phyloseq object contains:

  • 380461 total reads
  • across 6160 ASVs
  • and 30 samples
The working phyloseq object.
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5933 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5933 taxa by 8 taxonomic ranks ]

The working phyloseq object with contaminants removed(ps_slv_work_filt) contains:

  • 361694 total reads
  • across 5933 ASVs
  • and 30 samples

The mean number of reads per sample is 12056. The minimum number of reads per sample is 154 while the maximum is 33829 reads.


Note the command to remove taxa (subset_taxa within phyloseq) removes anything that is NA for the taxonomic level of interest or above. That’s fine if you are working at the Phylum level—for example is you are only getting rid of unknown Bacteria and unknown unknowns. However if you want to run this command using Family != "Mitochondria" you will not only get rid all ASVs classified as Mitochondria but everything else that has NA for Family and above. This seems pretty insane to me. Our dataset only had 27 Mitochondria ASVs but running the command as is will remove any ASVs not classified at Family level or above. And it is not well documented that the command is doing this—I had to go digging through the files to figure out what was happening. So lets see if we can get rid of just Mitochondria ASVs.

3. Sample summary: add total reads and total ASVs to sample summary table

Summary table of total reads & ASVs


At this point we can amend the sample summary table with total reads and ASVs for each sample. If you sort the table by total_reads you can see that 5 samples (all Coral) have less than 1000 reads. We should probably eliminate those samples from the analysis.


Two more things to do at this point are to create a merged phyloseq object grouped by environment AND create a phyloseq object for a subset of samples. These will come in handy later for some analyses. To make a merged ps object we collapse all samples from the same environment type together.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5933 taxa and 4 samples ]
sample_data() Sample Data:       [ 4 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5933 taxa by 8 taxonomic ranks ]
[1] "Coral"    "Mat"      "Sediment" "Water"   

Great, still 5933 ASVs but now only 4 “samples” corresponding to the unique environment types.

Water samples only

The second is to selcet only water samples and make a separate ps object. To do this you must select the samples you wish to keep. If you want to change the group of samples, modify the script accordingly. For now this function is off meaning you must modify the eval flag in the code chunk and list sample names for this to run.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5933 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5933 taxa by 8 taxonomic ranks ]
  1. If you remove samples you probably lost some ASVs. So you need to get rid of any ASVs that have a total of 0 reads.
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1111 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1111 taxa by 8 taxonomic ranks ]

So for water samples only there are 1111 ASVs and 8 samples.

4. Summary of Data Preparation: add total reads and total ASVs to sample summary table

ps_slv: Raw phyloseq object

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 6160 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 6160 taxa by 8 taxonomic ranks ]

This ps object contains 6160 ASVs, 380461 total reads, and 30 samples.

ps_slv_work_filt: Contaminants removed

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5933 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5933 taxa by 8 taxonomic ranks ]

This ps object contains 5933 ASVs, 361694 total reads, and 30 samples after removing Mitochondria, Chloroplast, and Eukaryotes.

mergedGP: Grouped by environment

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5933 taxa and 4 samples ]
sample_data() Sample Data:       [ 4 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5933 taxa by 8 taxonomic ranks ]

This ps object contains 5933 ASVs, 361694 total reads, and 30 samples after collapsing by Environment (ENV).

ps_slv_water: Water samples only

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1111 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1111 taxa by 8 taxonomic ranks ]

This ps object contains 1111 ASVs, 187164 total reads, and 8 samples after collapsing by Environment (ENV).


At this point in the workflow there are the several different phyloseq objects to chose from and, using the above methods, additional objects can easily be created.

Diversity

Overview In this section we will tackle diversity by looking taxonomic content as well as alpha and beta diversity metrics.

We created two different color palettes—one for taxa and one for environment—using the same 9 colors.


My soapbox about color

Much of the subsequent taxonomic analyses will be at the Class & Family levels. We chose not to focus on the Genus level because there simply is not enough resolution in our dataset to build a cohesive story. This is because most marine systems are (microbially) understudied and we are dealing with short read data. On the other hand, Phylum level is too coarse for groups like Proteobacteria and Firmicutes. Order did not provide any additional information and can be cumbersome for taxa with poorly resolved lineages. Depending on the dataset, you may want to change your strategy.

Color blindness & graphics

Before we continue we need to talk about color. Microbial diversity is huge and it is difficult to display all of this diversity in a single, static figure. As microbial ecologists we often use colors to convey our message. Yes many of us have a decreased ability to see color or differences in color. So for our figures we are going to create a custom, color friendly palette based on Bang Wong’s scheme described in this paper. Wong’s color scheme uses contrasting colors that can be distinguished by folks with color vision deficiency—roughly 8% of people (mostly males) are color blind. Do want Keanu Reeves to understand your figures or not?

This scheme is conservative—there are only 7 colors. We added black and grey to give us a little wiggle room. Others have developed 12 and 15 color palettes, but be careful—figures with too many colors can inhibit our ability to discern patterns. Our conservative palette forces us to choose carefully when deciding which taxa to target or how many groups to display.

Here are the hex codes for the colors:
#009E73, #D55E00, #F0E442, #CC79A7, #56B4E9, #E69F00, #0072B2, #7F7F7F, #000000

Total reads & ASVs by Taxa: What are the dominant taxa in this system? Here we take a Class-level look at overall diversity.

Total reads & ASVs by Class


Most reads by taxa: Alphaproteobacteria, Bacteroidia, Clostridia, Deltaproteobacteria, Gammaproteobacteria, Oxyphotobacteria

Most ASVs by taxa: Alphaproteobacteria, Bacteroidia, Clostridia, Deltaproteobacteria, Gammaproteobacteria, Planctomycetacia

Relative read abundance: In this section we calculate relative percent read abundance for the combined dataset. This allows us to collapse rare taxa into an Other category.

Class-level relative abundance


Lets take a closer look Class-level taxonomic content of these communities. There are numerous ways to do this but here we chose to collapse samples by environment and display the relative abundance of the most dominant taxa. We also generated alternative views of taxonomic composition for individual samples—a box-and-whisker plot as well as separate bar plots.

I know stacked bar charts are pretty lame but we like them for a birds eye view of the data. Here we calculate the relative abundance of taxa for each environment at the Class level. It turns out this is not too easy in phyloseq and there is a lot of (messy) code.

Since our goal is to generate a figure and we only have 9 colors, some taxa will need to be put into an Other category. We can define ‘Other’ however we like so lets take a look at the overall relative abundance of each Class.

Inspecting the table it looks like if we choose a cutoff of 0.01748 (1.748%) we get 8 taxa—sounds pretty good. The rest go into the ‘Other’ category. No matter what, we will always gloss over some groups using such a coarse approach. But as we see later some of these lower abundance groups will reappear when we look at communities at the level of individual ASVs. Next we define the Other category and draw the graph.

Class abundance across sample type: Here we generate a bar chart to look at taxonomic abundance for each of the main environment types.


At a 1.5% abundance cutoff, 91 Classes are grouped into the ‘Other’ category. Great, now we can craft the bar chart. And here are the taxa that will go into the chart:

It took some tweaking to get the bar chart to look just right—so there is a lot of code here—and it could most certainly be better :/. While we’re at it we will also save a copy of the figure so we can tweak it later and make it look pretty.

One thing to notice is that the sediment samples have a large percentage of ASVs grouped into the Other category. We will see what these taxa are in the next panel.

Class abundance across sample type: Here we generate a modified bar chart separated by individual samples.


Here we see the same data as in the last figure except now presented as individual samples. We see that the sediment samples all have a larger percentage of Other taxa. If we examine Phylum level abundance in sediment samples we see several groups that are not well represented in the rest of the dataset including

Phylum mean Sed samples mean All samples
Proteobacteria 0.3956 0.3382
Bacteroidetes 0.1646 0.2780
Planctomycetes 0.1228 0.0374
Acidobacteria 0.0884 0.0187
Latescibacteria 0.0260 0.0094
Chloroflexi 0.0220 0.0045
Gemmatimonadetes 0.0196 0.0044
Spirochaetes 0.0186 0.0162
Kiritimatiellaeota 0.0153 0.0041
Thaumarchaeota 0.0151 0.0032
Calditrichaeota 0.0138 0.0028
Nanoarchaeaeota 0.0115 0.0026
Nitrospirae 0.0105 0.0021
Cyanobacteria 0.0078 0.1061
Euryarchaeota 0.0060 0.0094
Firmicutes 0.0056 0.0918
Crenarchaeota 0.0052 0.0014
Modulibacteria 0.0049 0.0010
Zixibacteria 0.0040 0.0008
BRC1 0.0032 0.0006
Fibrobacteres 0.0030 0.0006
NA 0.0028 0.0015
Hydrogenedentes 0.0027 0.0005
Lentisphaerae 0.0025 0.0008
Actinobacteria 0.0025 0.0075
Omnitrophicaeota 0.0023 0.0005
Fusobacteria 0.0019 0.0064
Marinimicrobia_(SAR406_clade) 0.0017 0.0049
Verrucomicrobia 0.0017 0.0026
Dadabacteria 0.0015 0.0008
Elusimicrobia 0.0014 0.0003
Nitrospinae 0.0012 0.0003
PAUC34f 0.0012 0.0002
Asgardaeota 0.0011 0.0002
Schekmanbacteria 0.0010 0.0002
Entotheonellaeota 0.0009 0.0002
Tenericutes 0.0007 0.0262
AncK6 0.0007 0.0001
Hydrothermarchaeota 0.0005 0.0001
LCP-89 0.0005 0.0001
Dependentiae 0.0004 0.0001
Aegiribacteria 0.0003 0.0001
Cloacimonetes 0.0002 0.0044
Patescibacteria 0.0002 0.0003
TA06 0.0002 0.0000
WS2 0.0001 0.0000
Deinococcus-Thermus 0.0001 0.0001
Deferribacteres 0.0001 0.0007
Epsilonbacteraeota 0.0001 0.0039
Margulisbacteria 0.0001 0.0000
CK-2C2-2 0.0001 0.0000
Altiarchaeota 0.0000 0.0000
WS1 0.0000 0.0000
FCPU426 0.0000 0.0000
Atribacteria NA 0.0000
Synergistetes NA 0.0000
Thermotogae NA 0.0006
WPS-2 NA 0.0001

\(\alpha\)-diversity estimates Here we use several diversity indices and add the results to the summary table.


Alpha diversity describes the diversity in a sample or site. There are several alpha diversity metrics available in phyloseq: Observed, Chao1, ACE, Shannon, Simpson, InvSimpson, Fisher. Play around to see how different metrics change or confirm these results.

Here we want to know if diversity is significantly different across environments. In order to do that we need to know if we should run a parametric or non-parametric test, and for that we need to know if our data is normally distributed. Most of the ideas/code for alpha (and subsequent beta) diversity statistics come from this workshop tutorial by Kim Dill-McFarland and Madison Cox.

First we run the diversity estimates, add these data to our summary table, and save a copy of this table.

\(\alpha\)-diversity statistics: Test whether the results are normally distributed. Results of this will help guide the analyses we do next.

Shapiro-Wilk Normality Test for Shannon index.


    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$Shannon
W = 0.93, p-value = 0.05

Shapiro-Wilk Normality Test for inverse Simpson index.


    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$InvSimpson
W = 0.71, p-value = 0.000002

Shapiro-Wilk Normality Test for Chao1 richness estimator.


    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$Chao1
W = 0.89, p-value = 0.004

Shapiro-Wilk Normality Test for Observed ASV richness estimator.


    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$Observed
W = 0.89, p-value = 0.004

Next, we add the diversity estimates to our phyloseq object, and test if the data are normally distributed using Shapiro-Wilk Normality test. We will focus on the inverse Simpson and Shannon diversity estimates, the Chao richness estimate, and Observed richness but this approach can be used for any metric.

Ok, since the p-values are significant for the inverse Simpson, Chao richness, and Observed ASV richness we reject the null hypothesis that these data are normally distributed.

However, the Shannon estimates appear normally distributed. So lets see if diversity is significantly different between environment based on the Shannon index.

Assessing significance of \(\alpha\)-diversity metrics: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.

Normally distributed metrics

            Df Sum Sq Mean Sq F value  Pr(>F)    
ENV          3   48.8   16.25      46 1.6e-10 ***
Residuals   26    9.2    0.35                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Shannon ~ ENV, data = sampledataDF)

$ENV
                  diff      lwr     upr  p adj
Mat-Coral       2.4703  1.52853  3.4120 0.0000
Sediment-Coral  3.1813  2.36572  3.9969 0.0000
Water-Coral     0.7671  0.02254  1.5116 0.0417
Sediment-Mat    0.7110 -0.34190  1.7639 0.2727
Water-Mat      -1.7032 -2.70211 -0.7043 0.0004
Water-Sediment -2.4142 -3.29518 -1.5333 0.0000

Non-normally distributed metrics

Kruskal-Wallis of inverse Simpson index.


    Kruskal-Wallis rank sum test

data:  InvSimpson by ENV
Kruskal-Wallis chi-squared = 20, df = 3, p-value = 0.0002

Pairwise significance test for inverse Simpson index.


    Pairwise comparisons using Wilcoxon rank sum test 

data:  sampledataDF$InvSimpson and sampledataDF$ENV 

         Coral  Mat   Sediment
Mat      0.002  -     -       
Sediment 0.0006 0.011 -       
Water    0.792  0.006 0.002   

P value adjustment method: fdr 

Kruskal-Wallis of Chao1 richness estimator.


    Kruskal-Wallis rank sum test

data:  Chao1 by ENV
Kruskal-Wallis chi-squared = 26, df = 3, p-value = 0.00001

Pairwise significance test for Chao1 richness estimator.


    Pairwise comparisons using Wilcoxon rank sum test 

data:  sampledataDF$Chao1 and sampledataDF$ENV 

         Coral  Mat   Sediment
Mat      0.002  -     -       
Sediment 0.0003 0.067 -       
Water    0.0001 0.034 0.001   

P value adjustment method: fdr 

Kruskal-Wallis of Observed ASV richness index.


    Kruskal-Wallis rank sum test

data:  Observed by ENV
Kruskal-Wallis chi-squared = 26, df = 3, p-value = 0.00001

Pairwise significance test for Observed ASV richness index.


    Pairwise comparisons using Wilcoxon rank sum test 

data:  sampledataDF$Observed and sampledataDF$ENV 

         Coral  Mat   Sediment
Mat      0.002  -     -       
Sediment 0.0003 0.067 -       
Water    0.0001 0.034 0.001   

P value adjustment method: fdr 

Since the Shannon data is normally distributed we can test for significance using ANOVA (a parametric test).

Ok, the results of the ANOVA are significant. Here we use the Tukey’s HSD (honestly significant difference) post-hoc test to determine which pairwise comparisons are different.

Looks like everything is significantly different except for Mat-Sediment communities from Cayo Roldan. Interesting.

Now we can look at the results on the inverse Simpson diversity and Chao’s richness. Since Environment is categorical, we use Kruskal-Wallis (non-parametric equivalent of ANOVA) to test for significance.

For the inverse Simpson, Chao1, and Observed richness the results of the Kruskal-Wallis rank sum test are significant. So we can look at pairwise comparisons using Wilcoxon rank sum test for post-hoc analysis.

\(\alpha\)-diversity plots: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.

Alpha diversity of microbial communities


Here we plot the results of each diversity index and include the Observed ASV richness. We will save a copy of the figure for later tweaking. We use the color palette described above to delineate environments.

\(\beta\)-diversity analysis: NMDS coupled with Jensen-Shannon divergence.


Call:
metaMDS(comm = veganifyOTU(physeq), distance = distance) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(veganifyOTU(physeq))) 
Distance: bray 

Dimensions: 2 
Stress:     0.1584 
Stress type 1, weak ties
Two convergent solutions found after 20 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(veganifyOTU(physeq)))' 

Beta diversity basically tells us how similar or dissimilar samples are to one another. Phyloseq offers several ordination option using the method flag and distance metrics. Here we use non metric multidimensional scaling (NMDS) coupled with Jensen-Shannon divergence. We also save a copy of the figure for later tweaking.

We see that a convergent solution was reached around 20 iterations and our stress is below 0.20, meaning that 2-axes are sufficient to view the data. Generally, we are looking for stress values below 0.2. If the stress values are high, you may need to add more axes to the ordination. Lets visualize the plot.

\(\beta\)-diversity NMDS plot :


So we can see some clustering within groups and spread between groups, but this is not a test for statistical differences. Do microbial communities differ significantly by environment?

\(\beta\)-diversity statistics :

First we use the adonis function in vegan to run a PERMANOVA test. This will tell us whether environments have similar centroids or not.


Call:
adonis(formula = fish.jsd ~ ENV, data = sampledf, permutations = 1000) 

Permutation: free
Number of permutations: 1000

Terms added sequentially (first to last)

          Df SumsOfSqs MeanSqs F.Model    R2 Pr(>F)    
ENV        3      2.81   0.937    7.65 0.469  0.001 ***
Residuals 26      3.19   0.123         0.531           
Total     29      6.00                 1.000           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These results indicate that centroids are significantly different across environments meaning that communities are different by environments.

We can also use the pairwiseAdonis package for pair-wise PERMANOVA analysis.

              pairs F.Model     R2 p.value p.adjusted sig
1      Coral vs Mat   4.363 0.2376   0.001      0.006   *
2 Coral vs Sediment   4.837 0.2321   0.001      0.006   *
3    Coral vs Water   8.308 0.3158   0.001      0.006   *
4   Mat vs Sediment   9.299 0.5376   0.008      0.048   .
5      Mat vs Water  19.764 0.6640   0.002      0.012   .
6 Sediment vs Water  15.542 0.5643   0.001      0.006   *

Here we see again we see that communities are different by environments.

However, PERMANOVA assumes equal beta dispersion so we will use the betadisper function from the vegan package to calculate beta dispersion values.


    Homogeneity of multivariate dispersions

Call: betadisper(d = fish.jsd, group = sampledf$ENV, bias.adjust =
TRUE)

No. of Positive Eigenvalues: 28
No. of Negative Eigenvalues: 1

Average distance to median:
   Coral      Mat Sediment    Water 
   0.445    0.196    0.323    0.194 

Eigenvalues for PCoA axes:
(Showing 8 of 29 eigenvalues)
PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8 
1.310 0.831 0.804 0.447 0.400 0.264 0.242 0.232 

And then a pair-wise Permutation test for homogeneity of multivariate dispersions using permutest (again from the vegan package).


Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 1000

Response: Distances
          Df Sum Sq Mean Sq    F N.Perm Pr(>F)    
Groups     3  0.377   0.126 12.6   1000  0.001 ***
Residuals 26  0.259   0.010                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise comparisons:
(Observed p-value below diagonal, permuted p-value above diagonal)
            Coral      Mat Sediment Water
Coral             9.99e-04 9.99e-04  0.00
Mat      5.74e-10          2.00e-03  0.98
Sediment 1.62e-06 2.07e-04           0.12
Water    1.94e-04 9.80e-01 1.21e-01      

These results are significant, meaning that environments have different dispersions. Looking at the pairwise p-values and permuted p-value, we see that the significant differences (p-value < 0.05) are between most environments.

This means we are less confident that the PERMANOVA result is a real result, and that the result is possibly due to differences in group dispersions.

We can also use Analysis of Similarity (ANOSIM)—which does not assume equal group variances—to test whether overall microbial communities differ by environment.


Call:
anosim(x = distance(ps_slv_work_filt, "jsd"), grouping = spgroup) 
Dissimilarity: 

ANOSIM statistic R: 0.701 
      Significance: 0.001 

Permutation: free
Number of permutations: 999

Upper quantiles of permutations (null model):
   90%    95%  97.5%    99% 
0.0924 0.1265 0.1504 0.1752 

Dissimilarity ranks between and within classes:
         0%    25%   50%    75%  100%   N
Between  52 173.75 260.5 345.25 409.5 320
Coral    28  72.25 106.0 223.75 409.5  66
Mat      13  14.25  15.5  16.75  19.0   6
Sediment 21  35.50  40.0  50.50  56.0  15
Water     1   7.75  21.0  31.50  48.0  28

And the AN0SIM result is significant meaning that environment influences microbial community composition.


To test whether microbial communities differ by environment we can use permutational analysis of variance (PERMANOVA) or analysis of similarity (ANOSIM). PERMANOVA does not assume normality but does assume equal beta dispersion between groups. We will test beta dispersion below.

Water

Site differences: Here we look at differences between the two sites. For


---
title: "Bocas Hypoxia"
output: 
  flexdashboard::flex_dashboard:
    storyboard: TRUE
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(phyloseq); packageVersion("phyloseq")
library(ggplot2); packageVersion("ggplot2")
library(plotly)
library(DT)
library(kableExtra)
library(tidyverse)
library(data.table)
library(dplyr)
library("plyr"); packageVersion("plyr")
library(vegan)
library(pairwiseAdonis)
library(knitr)

```

```{r set_wd, include=FALSE}
remove(list = ls())
knitr::opts_knit$set(root.dir = getwd())
# This will setwd to wherever the .Rmd file is opened.
ptm <- proc.time()
start_time <- Sys.time()
opts_chunk$set(cache=TRUE)
#formatR::tidy_app() run this in R to tidy code. How to do it here?
```

Physicochemistry  {.storyboard}
=========================================

### **Depth Profiles**:  Dissolved Oxygen Depth Profiles  {data-commentary-width=600}

INSERT DESCRIPTION AND METHODS (NOELLE)

```{r}
ds <- read.csv(file="DATA_TABLES/INPUT/depth_profiles.csv",
                  stringsAsFactors = FALSE)
```


```{r depth_profile}
ds$site_o= factor(ds$site, levels=c('Crawl_Key','Cayo_Roldon','Hospital_Point','Finca',
                                    'Mainland', 'Pastores', 'Punta_Caracol', 'Almirante'))
p1   <- ggplot(ds, aes(DO,depth_m)) +
  geom_point(shape = 16, size = 1.5, color = "darkblue") +
  facet_wrap(~ site_o, ncol = 2) +
  labs(#title = "Dissolved Oxygen Depth Profiles",
      # subtitle = "Data plotted by site, Sept. 21, 2017",
       y = "depth (m)",
       x = "DO mg/L") +
  geom_vline(xintercept = 2, linetype="dashed",color = "red", size=.5) +
  scale_y_continuous(trans = "reverse",  limits = c(11,0))
gp <- ggplotly(p1)
gp %>% layout(margin = list(b = 75))
#plotly::ggplotly(gp)%>%layout(autosize = T)
#geom_smooth(method = loess, color = "darkblue")+
#theme_bw(base_size = 10) +
# set scale to 11 m, add smoother line, and add hypoxia marker
```

***

Physicochemical metrics by site


```{r}
datatable(ds, 
          style="bootstrap", 
          class="compact", 
          rownames = FALSE, 
          width = "100%", 
          caption = htmltools::tags$caption(style = 
                        "caption-side: top; text-align: left;", 
                        "Table X: ", htmltools::em("Site Physicochemistry.")), 
                        extensions = "Buttons", 
                        options = list(columnDefs = 
                                        list(list(className = "dt-left", targets = 0)), 
                                        dom = "Blfrtip", buttons = c("csv", "copy"), 
                                        scrollX = TRUE, scrollCollapse = FALSE,
                                        deferRender=TRUE,  scrollY=300, scroller=TRUE,
                                        lengthMenu = c(10, 100, 250)))
```


### **Response of coral to hypoxia**:

Discuss with Maggie on what should be displayed here. Also Need shape files

```{r hypoxia_coral_response, eval = FALSE, include = FALSE}
rm(list=ls())
library(lme4)
library(lmerTest)
library(car)
library(rcompanion)
library(MASS)
library(emmeans)
library(nlme)

###Do analysis on each response variable separately
###For each response- site as fixed factor and colony as random factor 

data <- read.table("Data_Tables/INPUT/hypoxiadata.csv", header=T, sep=",")

##Make variables in dataframe accessible by name within a session
attach(data)
#head(data)

##Check the data class, categories etc.
#class(site)
#levels(site)
#class(colony)

#################################
######### PAM ###################
#################################

### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data (i.e., not transformed) and residuals
## Data are normal, variances hetergeneous (transform doesn't improve)


########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
plotNormalHistogram(yield)


########## DISTRIBUTION OF RESIDUALS #############################
### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect

lmepam<-lmer(yield~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res1 <- resid(lmepam)

#Produce a histogram of the residuals.
hist(res1,main="Yield",xlab="Residuals")
plotNormalHistogram(res1)

####Shapiro test on residuals from lme model p = 0.4312
pam.shapiro <- shapiro.test(res1) #runs a normality test on residuals
print(pam.shapiro) # null = normally distrubuted P<0.05 = non-normal

##QQ plots of residuals from lme model
qqnorm(res1,
       ylab="Sample Quantiles for PAM")
qqline(res1, 
       col="red")

##### Levene Test
leveneTest(yield~ as.factor(site),data=data) # p = 0.00785

bartlett.test(x=yield, g=site) # p = 0.01178

#### Data normal but not homogenous- log transform does not improve
###########################################################


############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
####  Compared full model with fragment dropped
####  Best fit model (with lowest AIC) was the model with colony dropped
####  BUT using model that includes fragment because it's necessary random effect

##When comparing models with different random effects like here- use REML=FALSE. But when not comparing use REML=TRUE (default)
#full model- includes site, colony, and fragment nested in colony

#(colony/frag includes a term for colony alone and nests fragment in colony)
pam1<-lmer(yield~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude)  #AIC -121.1218
ranova(pam1)

#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
pam2<-lmer(yield~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude)  #AIC -123.1218
ranova(pam2)

mod.AIC<-AIC(pam1, pam2) 
mod.AIC  ### use the full model to account for nesting fragments

##Final models uses REML- not comparing models with different random effects
finalpam <- lmer(yield~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)

# p values for random effects
finalrandpam <- ranova(finalpam)

pam.table<-anova(finalpam, type=2) #ANOVA table
pam.table
finalrandpam


#################################
######### ZOOX ###################
#################################

### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data and residuals 
## CR6B is an outlier and removed- log-transform data makes data normal and homogenous

########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
attach(data)
plotNormalHistogram(zoox)

########## DISTRIBUTION OF RESIDUALS #############################
### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect
## yield is normally distributed, but variances not homogenous

lmezoox<-lmer(zoox~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res2 <- resid(lmezoox)

#Produce a histogram of the residuals.
hist(res2,main="Zoox",xlab="Residuals")
plotNormalHistogram(res2)

####Shapiro test on residuals from lme model p = 0.004795
zoox.shapiro <- shapiro.test(res2) #runs a normality test on residuals
print(zoox.shapiro) # null = normally distrubuted (P<0.05 = non-normal

##QQ plots of residuals from lme model
qqnorm(res2,
       ylab="Sample Quantiles for Zoox")
qqline(res2, 
       col="red")

##### Levene Test
leveneTest(zoox~ as.factor(site),data=data) # p = 0.0007486

##############################################################################################
################################### Data Transformations #####################################
##############################################################################################

###########################################################
########## Log transformation ##############################
############# Log transform improves normality- data are homegenous

### Add a new column to dataframe for log transformation 
data$zoox_log = log(zoox)

### Check that log transformed data is added as a new column
head(data)
attach(data)

### Check normality of log-transformed data
plotNormalHistogram(zoox_log)

#Check normality of log-transformed residuals
lmezoox_log<-lmer(zoox_log~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res3 <- resid(lmezoox_log)

#Produce a histogram of the residuals.
hist(res3,main="zoox log",xlab="Residuals")
plotNormalHistogram(res3)

####Shapiro test on residuals from lme model = non-normal p = 0.3461
zooxlog.shapiro <- shapiro.test(res3) #runs a normality test on residuals
print(zooxlog.shapiro) # null = normally distrubuted (P<0.05 = non-normal

##QQ plots of residuals from lme model
qqnorm(res3,
       ylab="Sample Quantiles for Zooxlog")
qqline(res3, 
       col="red")

##### Levene Test
leveneTest(zoox_log~ as.factor(site),data=data) # p = 0.7968


############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
####  Use log-transformed data

#(colony/frag includes a term for colony alone and nests fragment in colony)
zoox1<-lmer(zoox_log~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude)  #AIC 72.8756
ranova(zoox1)

#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
zoox2<-lmer(zoox_log~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude)  #AIC 70.8756
ranova(zoox2)

mod.AIC<-AIC(zoox1, zoox2) 
mod.AIC  ### use the full model to account for nesting fragments

##Final models uses REML- not comparing models with different random effects
finalzoox <- lmer(zoox_log~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)

# p values for random effects
finalrandzoox <- ranova(finalzoox)
zoox.table<-anova(finalzoox, type=2) #ANOVA table
zoox.table
finalrandzoox


#################################
######### Chla  ###################
#################################

### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data and residuals 

########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
plotNormalHistogram(chla)

########## DISTRIBUTION OF RESIDUALS #############################

### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect
## raw data are normal but variances not homogenous- log-transformed improves homogeneity

lmechla<-lmer(chla~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res4 <- resid(lmechla)

#Produce a histogram of the residuals.
hist(res4,main="Chla",xlab="Residuals")
plotNormalHistogram(res4)

####Shapiro test on residuals from lme model p = 0.3015
chla.shapiro <- shapiro.test(res4) #runs a normality test on residuals
print(chla.shapiro) # null = normally distrubuted (P<0.05 = non-normal

##QQ plots of residuals from lme model
qqnorm(res4,
       ylab="Sample Quantiles for Zoox")
qqline(res4, 
       col="red")

##### Levene Test
leveneTest(chla~ as.factor(site),data=data) # p = 0.01606

#### Data are normal but variances are not homogenous

###########################################################
########## Log transformation ##############################

### Add a new column to dataframe for log transformation 
data$chla_log = log(chla)

### Check that log transformed data is added as a new column
head(data)
attach(data)

### Check normality of log-transformed data
plotNormalHistogram(chla_log)

#Check normality of log-transformed residuals
lmechla_log<-lmer(chla_log~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res7 <- resid(lmechla_log)

#Produce a histogram of the residuals.
hist(res7,main="chla log",xlab="Residuals")
plotNormalHistogram(res7)

####Shapiro test on residuals from lme model = non-normal p = 0.1627
chlalog.shapiro <- shapiro.test(res7) #runs a normality test on residuals
print(chlalog.shapiro) # null = normally distrubuted (P<0.05 = non-normal

##QQ plots of residuals from lme model
qqnorm(res7,
       ylab="Sample Quantiles for chlalog")
qqline(res7, 
       col="red")

##### Levene Test
leveneTest(chla_log~ as.factor(site),data=data) # p = 0.7487

############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
### on log transformed data

#(colony/frag includes a term for colony alone and nests fragment in colony)
chla_log1<-lmer(chla_log~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude)  #AIC 55.11567
ranova(chla_log1)

#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
chla_log2<-lmer(chla_log~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude)  #AIC 53.11567
ranova(chla_log2)

mod.AIC<-AIC(chla_log1, chla_log2) 
mod.AIC  ### use the full model to account for nesting fragments

##Final models uses REML- not comparing models with different random effects
finalchla_log <- lmer(chla_log~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)

# p values for random effects
finalchla_logrand <- ranova(finalchla_log)
chla_log.table<-anova(finalchla_log, type=2) #ANOVA table
chla_log.table
finalchla_logrand

#################################
######### chlc  ###################
#################################

### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data and residuals 
## Data need to be log transformed for normality and variances

########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
plotNormalHistogram(chlc)

########## DISTRIBUTION OF RESIDUALS #############################

### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect
## yield is normally distributed, but variances not homogenous

lmechlc<-lmer(chlc~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res5 <- resid(lmechlc)

#Produce a histogram of the residuals.
hist(res5,main="chlc",xlab="Residuals")
plotNormalHistogram(res5)

####Shapiro test on residuals from lme model p = 0.004475
chlc.shapiro <- shapiro.test(res5) #runs a normality test on residuals
print(chlc.shapiro) # null = normally distrubuted (P<0.05 = non-normal

##QQ plots of residuals from lme model
qqnorm(res5,
       ylab="Sample Quantiles for Zoox")
qqline(res5, 
       col="red")

##### Levene Test
leveneTest(chlc~ as.factor(site),data=data) # p = 0.01316

#### Data not normally distributed or homogenous

###########################################################
########## Log transformation ##############################
############# Log transform improves normality and homogeneity

### Add a new column to dataframe for log transformation 
data$chlc_log = log(chlc)

### Check that log transformed data is added as a new column
head(data)
attach(data)

### Check normality of log-transformed data
plotNormalHistogram(chlc_log)

#Check normality of log-transformed residuals
lmechlc_log<-lmer(chlc_log~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res6 <- resid(lmechlc_log)

#Produce a histogram of the residuals.
hist(res6,main="chlc log",xlab="Residuals")
plotNormalHistogram(res6)

####Shapiro test on residuals from lme model = non-normal p = 0.8874
chlclog.shapiro <- shapiro.test(res6) #runs a normality test on residuals
print(chlclog.shapiro) # null = normally distrubuted (P<0.05 = non-normal

##QQ plots of residuals from lme model
qqnorm(res6,
       ylab="Sample Quantiles for chlclog")
qqline(res6, 
       col="red")

##### Levene Test
leveneTest(chlc_log~ as.factor(site),data=data) # p = 0.884

############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
####### On log-transformed data

#(colony/frag includes a term for colony alone and nests fragment in colony)
chlc_log1<-lmer(chlc_log~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude)  #AIC 64.52399
ranova(chlc_log1)

#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
chlc_log2<-lmer(chlc_log~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude)  #AIC 62.52399
ranova(chlc_log2)

mod.AIC<-AIC(chlc_log1, chlc_log2) 
mod.AIC  ### use the full model to account for nesting fragments

##Final models uses REML- not comparing models with different random effects
finalchlc_log <- lmer(chlc_log~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)

# p values for random effects
finalchlc_logrand <- ranova(finalchlc_log)

chlc_log.table<-anova(finalchlc_log, type=2) #ANOVA table
chlc_log.table
finalchlc_logrand
```

16S rRNA Data Prep {.storyboard}
=========================================

### 1. **Define sample groups**: To prepare for downstream analysis, we first add  group designations.  {data-commentary-width=500}

1. The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (`combo_pipeline.rdata`) contains the ASV-by-sample table and the ASV taxonomy table. 

2. After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.

```{r deliniate_sample_types}
load("../01_DADA2_WORKFLOW/combo_pipeline.rdata")
samples.out <- rownames(seqtab)
subject <- sapply(strsplit(samples.out, "[[:digit:]]"), `[`, 1)
# this splits the string at first instance of a digit
sample_name <- substr(samples.out, 1, 999)  # use the whole string for individuals
environ <- substr(samples.out, 0, 1)  # use the first two letters for genus
site <- substr(samples.out, 2, 3)  # use the next three letters for species
num_samp <- length(unique(sample_name))
num_environ <- length(unique(environ))
num_sites <- length(unique(site))
```

3. And finally we define a sample data frame that holds the different groups we extracted from the sample names. On the right are the  sample names and the different group categories.  

> Sample abbreviations are as follows: 

**X** indicates the Environment

* W = Water
* S = Sediment
* C = Coral
* M = Mat

**YY** indicates the site name

* CC = Coral Caye
* CR = Cayo Roldan

**Z** indicates the replicate number

So  `MCR5` is a Mat sample from Cayo Roldan, replicate 5. 

*** 

Table of samples


```{r define_variables}
#define a sample data frame
samdf <- data.frame(SamName = sample_name, ENV = environ, SITE = site) 
samdf$SITE <- gsub('CC', 'Coral Caye', samdf$SITE)
samdf$SITE <- gsub('CR', 'Cayo Roldan', samdf$SITE)
samdf$ENV <- gsub('W', 'Water', samdf$ENV)
samdf$ENV <- gsub('S', 'Sediment', samdf$ENV)
samdf$ENV <- gsub('C', 'Coral', samdf$ENV)
samdf$ENV <- gsub('M', 'Mat', samdf$ENV)

rownames(samdf) <- samples.out
kable(samdf, row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)  %>%
  column_spec(1:3, width = "3.5cm")
```

### 2. **Create & modify a phyloseq object**: a phyloseq (`ps`) object serves as the basis for all downstream analyses.{data-commentary-width=650}

Create base phyloseq object

**A**. Create a `phyloseq` (ps) object using the silva (slv) taxonomy generated in the DADA2 workflow. 

**B**. Rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can  get cumbersome downstream, so we rename the ASVs using a simpler convention---ASV1, ASV2, ASV3, and so on. 

```{r create_ps_object}
ps_slv <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE), sample_data(samdf), 
    tax_table(tax_silva)) # this create the phyloseq object
tax_table(ps_slv) <- cbind(tax_table(ps_slv), rownames(tax_table(ps_slv)))
taxa_names(ps_slv) <- paste0("ASV", seq(ntaxa(ps_slv))) # adding unique ASV names
tax_table(ps_slv) <- cbind(tax_table(ps_slv), rownames(tax_table(ps_slv)))
head(taxa_names(ps_slv))
```

**C**. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file.

```{r add_ASV_coulmn}
colnames(tax_table(ps_slv)) <- c("Kingdom", "Phylum", "Class", "Order", 
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
```

**D**. Export sequence and taxonomy tables for the unadulterated phyloseq object to the `DATA_TABLES` directory for later use. We will use the prefix `full` to indicate that these are raw data products. 

```{r export_seq_tax_tables}
write.table(tax_table(ps_slv), "DATA_TABLES/OUTPUT/full_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv)), "DATA_TABLES/OUTPUT/full_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```

Remove contaminants & unwanted taxa

**E**. Remove any Mitochondria hits. Remember the original dataset contained `r ntaxa(ps_slv)` ASVs.

To accomplish this we do the following:

* Subset the taxa to generate a `ps` object of just Mitochondria, 
* Select the ASV column only, turn it into a factor, and use this to remove Mitochondria from the `ps` object. 

```{r remove_specific_taxa}
# generate a file with mitochondria ASVs
MT1 <- subset_taxa(ps_slv, Family == "Mitochondria")
MT1 <-  as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps_slv), MT1df)
ps_slv_work_no_mito <- prune_taxa(goodTaxa, ps_slv)
ps_slv_work_no_mito
```

Looks like this removed **`r ntaxa(ps_slv) - ntaxa(ps_slv_work_no_mito)` Mitochondria ASVs**. We will duplicate the code block to remove other groups. We will also get rid of any Eukaryota and Chloroplast ASVs. 

**F**. Remove any Chloroplast hits.

```{r remove_specific_taxa2}
# generate a file with mitochondria ASVs
CH1 <- subset_taxa(ps_slv_work_no_mito, Order == "Chloroplast")
CH1 <-  as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_slv_work_no_mito), CH1df)
ps_slv_work_no_chloro <- prune_taxa(goodTaxa, ps_slv_work_no_mito)
ps_slv_work_no_chloro
```

The code removed an additional **`r ntaxa(ps_slv_work_no_mito) - ntaxa(ps_slv_work_no_chloro)` ASVs** classified as Chloroplast. 

**G**. Remove any Eukaryotic hits.

```{r remove_specific_taxa3}
# generate a file with mitochondria ASVs
EU1 <- subset_taxa(ps_slv_work_no_chloro, Kingdom == "Eukaryota")
EU1 <-  as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_slv_work_no_chloro), EU1df)
ps_slv_work_filt <- prune_taxa(goodTaxa, ps_slv_work_no_chloro)
ps_slv_work_filt
```

Finally, **`r ntaxa(ps_slv_work_no_chloro) - ntaxa(ps_slv_work_filt)` Eukaryota ASVs** were eliminated from the `ps` object.


We will again export the sequence and taxonomy tables to the `DATA_TABLES` directory this time with the file prefix `trim` to indicate that contaminants have been removed from these data.

```{r export_seq_tax_tables_2}
write.table(tax_table(ps_slv_work_filt), "DATA_TABLES/OUTPUT/trim_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv_work_filt)), "DATA_TABLES/OUTPUT/trim_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```

***

The raw phyloseq object.

```{r, results = 'markdown' }
ps_slv
```

The raw phyloseq object contains: 

* `r sum(otu_table(ps_slv))` total reads  
* across `r ntaxa(ps_slv)` ASVs 
* and `r nsamples(ps_slv)` samples 


The working phyloseq object.
```{r, results = 'markdown' }
ps_slv_work_filt
```

The working phyloseq object with contaminants removed(`ps_slv_work_filt`) contains:

* `r sum(otu_table(ps_slv_work_filt))` total reads  
* across `r ntaxa(ps_slv_work_filt)` ASVs 
* and `r nsamples(ps_slv_work_filt)` samples 

```{r}
mean_reads <- as.integer(mean(sample_sums(ps_slv_work_filt)))
min_reads <- as.integer(min(sample_sums(ps_slv_work_filt)))
max_reads <- as.integer(max(sample_sums(ps_slv_work_filt)))
```

The mean number of reads per sample is **`r mean_reads`**. The minimum number of reads per sample is **`r  min_reads`** while the  maximum is **`r max_reads`** reads.

***

**Note** the command to remove taxa  (`subset_taxa` within phyloseq)  removes anything that is `NA` for the taxonomic level of interest or above. That's fine if you are working at the Phylum level---for example is you are only getting rid of unknown Bacteria and unknown unknowns. However if you want to  run this command using `Family != "Mitochondria"` you will not only get rid all ASVs classified as Mitochondria but everything  else that has `NA` for Family and above. This seems pretty insane to me. Our dataset only had 27 Mitochondria ASVs but running the  command *as is* will remove any  ASVs not classified at Family level or above. And it is not well documented that the command is doing this---I had to go digging through the files to figure out what was happening. So lets see if we can get rid of **just** Mitochondria ASVs.

### 3. **Sample summary**: add total reads and total ASVs to sample summary table{data-commentary-width=500}

Summary table of total reads & ASVs

```{r sample_summary_table}
total_reads <- sample_sums(ps_slv_work_filt)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")

total_asvs <- estimate_richness(ps_slv_work_filt, measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")

sam_details <- samdf
rownames(sam_details) <- NULL

colnames(sam_details) <- c("Sample_ID", "ENV", "Site") 

merge_tab <- merge(sam_details, total_reads, by = "Sample_ID")
merge_tab2 <- merge(merge_tab, total_asvs, by = "Sample_ID")
colnames(merge_tab2) <- c("Sample_ID", "Environment", "Site", 
    "total_reads", "total_ASVs")
```

```{r}
datatable(merge_tab2, 
          style="bootstrap",
          class="compact", 
          rownames = FALSE, 
          width = "100%", 
          caption = htmltools::tags$caption(style = 
                        "caption-side: top; text-align: left;", 
                        "Table X: ", htmltools::em("Total reads & ASVs by sample.")), 
                        extensions = "Buttons", 
                        options = list(columnDefs = 
                                        list(list(className = "dt-left", targets = 0)), 
                                        dom = "Blfrtip", buttons = c("csv", "copy"), 
                                        scrollX = TRUE, scrollCollapse = TRUE,                                                              deferRender=TRUE,  scrollY=300, scroller=TRUE,
                                        lengthMenu = c(10, 30)))
```

***

At this point we can amend the sample summary table with total reads and ASVs for each sample. If you sort the table by total_reads you can see that 5 samples (all Coral) have less than 1000 reads. We should probably eliminate those samples from the analysis. 

***


Two more things to do at this point are  to create a merged phyloseq object grouped by environment AND create a phyloseq object for a subset of samples. These will come in handy later for some analyses. To make a merged `ps` object we collapse all samples from the same environment type together.

```{r merge, warning=FALSE}
mergedGP <- merge_samples(ps_slv_work_filt, "ENV")
SD <- merge_samples(sample_data(ps_slv_work_filt), "ENV")
mergedGP
sample_names(mergedGP)
```

Great, still `r ntaxa(mergedGP)` ASVs but now only `r nsamples(mergedGP)` "samples" corresponding to the unique environment types. 


**Water samples only**

The second is to selcet only water samples and make a separate `ps` object. To do this you must select the samples you *wish to keep*. If you want to change the group of samples, modify the script accordingly. For now this function  is *off* meaning you must modify the `eval` flag in the code chunk and list sample names for this to run.

```{r select_water_samples}
ps_slv_sub <- prune_samples(c("WCC0", "WCC1", "WCC2", "WCC3", "WCR0", "WCR1", "WCR2", "WCR3"), ps_slv_work_filt)
ps_slv_sub
```

6. If you remove samples you probably lost some ASVs. So you need to get rid of any ASVs that have a total of **0 reads**. 

```{r remove_ASV_with_zeros_reads}
ps_slv_water <- prune_taxa(taxa_sums(ps_slv_sub) > 0, ps_slv_sub)
ps_slv_water
write.table(tax_table(ps_slv_water), "DATA_TABLES/OUTPUT/water_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv_water)), "DATA_TABLES/OUTPUT/water_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```

So for water samples only there are `r ntaxa(ps_slv_water)` ASVs and `r nsamples(ps_slv_water)` samples.


### 4. **Summary of Data Preparation**: add total reads and total ASVs to sample summary table{data-commentary-width=500}

> `ps_slv`: Raw phyloseq object

```{r}
ps_slv
```

This `ps` object contains `r ntaxa(ps_slv)` ASVs, `r sum(otu_table(ps_slv))` total reads, and `r nsamples(ps_slv)` samples. 

> `ps_slv_work_filt`: Contaminants removed

```{r}
ps_slv_work_filt
```

This `ps` object contains `r ntaxa(ps_slv_work_filt)` ASVs, `r sum(otu_table(ps_slv_work_filt))` total reads, and `r nsamples(ps_slv_work_filt)` samples after removing Mitochondria, Chloroplast, and Eukaryotes. 

> `mergedGP`: Grouped by environment

```{r}
mergedGP
```

This `ps` object contains `r ntaxa(ps_slv_work_filt)` ASVs, `r sum(otu_table(ps_slv_work_filt))` total reads, and `r nsamples(ps_slv_work_filt)` samples after collapsing by Environment (ENV).


> `ps_slv_water`: Water samples only

```{r}
ps_slv_water
```

This `ps` object contains `r ntaxa(ps_slv_water)` ASVs, `r sum(otu_table(ps_slv_water))` total reads, and `r nsamples(ps_slv_water)` samples after collapsing by Environment (ENV).

***

At this point in the workflow there are  the several different phyloseq objects to chose from and, using the above methods, additional objects can easily be created. 


Diversity {.storyboard}
=========================================

### **Overview** In this section we will tackle diversity by looking taxonomic content as well as alpha and beta diversity metrics.  {data-commentary-width=600}

We created two different color palettes---one for taxa and one for environment---using the same 9 colors. 

```{r define_color_blind_scheme}
friend_pal <- c("#009E73", "#D55E00", "#F0E442", "#CC79A7", "#56B4E9", 
    "#E69F00", "#0072B2", "#7F7F7F", "#000000")

samp_pal <- c("#CC79A7", "#E69F00", "#009E73", "#56B4E9")

plot(x=1:9, y=rep(5,9), pch=19, cex=7, col=friend_pal, 
     col.axis = 'white', col.lab = 'white')

```

***

My soapbox about color

Much of the subsequent taxonomic analyses  will be at the **Class** & **Family** levels. We chose not to focus on the Genus level because there simply is not enough resolution in our dataset to build a cohesive story. This is because most marine systems  are (microbially) understudied and we are dealing with short read data. On the other hand, Phylum level is too coarse for groups like Proteobacteria and Firmicutes. Order did not provide any additional information and can be cumbersome for taxa with poorly resolved lineages. Depending on the dataset, you may want to change your strategy. 

Color blindness & graphics

Before we continue we need to talk about color. Microbial diversity is huge and it is difficult to display all of this diversity in a single, static figure. As microbial ecologists we often use colors to convey our message. Yes many of us have a decreased ability to see color or differences in color. So for our figures we are going to create a custom, color friendly palette based on Bang Wong's scheme described in this [paper](http://dx.doi.org/10.1038/nmeth.1618){target="_blank"}. Wong's color scheme uses contrasting colors that can be distinguished by folks with color vision deficiency---roughly 8% of people (mostly males) are color blind. Do want Keanu Reeves to understand your figures or not? 

This scheme is conservative---there are only 7 colors. We added black and grey to give us a little wiggle room. Others have developed [12 and 15 color palettes](http://mkweb.bcgsc.ca/colorblind/){target="_blank"}, but be careful---figures with too many colors can inhibit our ability to discern patterns. Our conservative palette forces us to choose carefully when deciding which taxa to target or how many groups to display. 

Here are the hex codes for the colors: 
`r friend_pal` ### **Total reads & ASVs by Taxa**: What are the dominant taxa in this system? Here we take a Class-level look at overall diversity. {data-commentary-width=500} Total reads & ASVs by Class ```{r diversity_table} # generate the ASV table tax_asv <- table(tax_table(ps_slv_work_filt)[, "Class"], exclude = NULL, dnn = "Taxa") tax_asv <- as.data.frame(tax_asv, make.names = TRUE) # generate the reads table tax_reads <- factor(tax_table(ps_slv_work_filt)[, "Class"]) tax_reads <- apply(otu_table(ps_slv_work_filt), MARGIN = 1, function(x) { tapply(x, INDEX = tax_reads, FUN = sum, na.rm = TRUE, simplify = TRUE) }) tax_reads <- as.data.frame(tax_reads, make.names = TRUE) tax_reads <- cbind(tax_reads, reads = rowSums(tax_reads)) tax_reads <- tax_reads[31] tax_reads <- setDT(tax_reads, keep.rownames = TRUE)[] # merge the two tables and make everything look pretty # in an interactive table taxa_read_asv_tab <- merge(tax_reads, tax_asv, by.x = "rn", by.y = "Taxa") top_reads <- top_n(taxa_read_asv_tab, n = 6, wt = reads) top_asvs <- top_n(taxa_read_asv_tab, n = 6, wt = Freq) names(taxa_read_asv_tab) <- c("Taxa", "total reads", "total ASVs") write.table(taxa_read_asv_tab, "DATA_TABLES/OUTPUT/Table_X_reads_ASVs_by_sample.txt", sep = "\t", row.names = FALSE, quote = FALSE) datatable(taxa_read_asv_tab, rownames = FALSE, width = "100%", colnames = c("Taxa", "total reads", "total ASVs"), caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;", "Table X: ", htmltools::em("Total reads & ASVs by Class")), extensions = "Buttons", options = list(columnDefs = list(list(className = "dt-left", targets = 0)), dom = "Blfrtip", buttons = c("csv", "copy"), scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE, lengthMenu = c(10, 50, 100))) top_reads2 <- top_reads[,-1] rownames(top_reads2) <- top_reads[,1] top_asvs2 <- top_asvs[,-1] rownames(top_asvs2) <- top_asvs[,1] ``` *** >**Most reads by taxa**: `r row.names(top_reads2)` >**Most ASVs by taxa**: `r row.names(top_asvs2)` ### **Relative read abundance**: In this section we calculate relative percent read abundance for the combined dataset. This allows us to collapse rare taxa into an **Other** category. {data-commentary-width=500} Class-level relative abundance ```{r calc_rel_abund_and_merge} # calculate the averages and merge by species ps_slv_filt_AVG <- transform_sample_counts(ps_slv_work_filt, function(x) x/sum(x)) mergedGP_BAR <- merge_samples(ps_slv_filt_AVG, "ENV") SD_BAR <- merge_samples(sample_data(ps_slv_filt_AVG), "ENV") # merge taxa by rank. If you choose a different rank be sure to change # the rank throughout this code chunk mdata_phy <- tax_glom(mergedGP_BAR, taxrank = "Class", NArm = FALSE) mdata_phyrel <- transform_sample_counts(mdata_phy, function(x) x/sum(x)) meltd <- psmelt(mdata_phyrel) meltd$Class <- as.character(meltd$Class) # calculate the total relative abundance for all taxa means <- ddply(meltd, ~Class, function(x) c(mean = mean(x$Abundance))) means$mean <- round(means$mean, digits = 5) taxa_means <- means[order(-means$mean), ] # this order in decending fashion taxa_means <- format(taxa_means, scientific = FALSE) # ditch the sci notation top_perc <- top_n(taxa_means, n = 8, wt = mean) top_perc$mean <- round(as.numeric(top_perc$mean), digits = 5) min_top_perc <- round(as.numeric(min(top_perc$mean)), digits = 5) ``` ```{r rel_abund_table} datatable(taxa_means, rownames = FALSE, width = "100%", colnames = c("Class", "mean"), caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;", "Table X: ", htmltools::em("Class-level relative abundance.")), extensions = "Buttons", options = list(columnDefs = list(list(className = "dt-left", targets = "_all")), dom = "Blfrtip", buttons = c("csv", "copy"), scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE, lengthMenu = c(10, 50, 100))) write.table(taxa_means, "DATA_TABLES/OUTPUT/Table_X_Taxa_Means.txt", sep = "\t", row.names = FALSE, quote = FALSE) ``` *** Lets take a closer look Class-level taxonomic content of these communities. There are numerous ways to do this but here we chose to collapse samples by environment and display the relative abundance of the most dominant taxa. We also generated alternative views of taxonomic composition for individual samples---a [box-and-whisker plot](#box-and-whisker-plot) as well as [separate bar plots](#separate-bar-plots). I know stacked bar charts are pretty lame but we like them for a birds eye view of the data. Here we calculate the relative abundance of taxa for each environment at the **Class** level. It turns out this is not too easy in phyloseq and there is a lot of (messy) code. Since our goal is to generate a figure and we only have 9 colors, some taxa will need to be put into an **Other** category. We can define 'Other' however we like so lets take a look at the overall relative abundance of each Class. Inspecting the table it looks like if we choose a **cutoff of `r min_top_perc`** (`r min_top_perc*100`%) we get 8 taxa---sounds pretty good. The rest go into the 'Other' category. No matter what, we will always gloss over some groups using such a coarse approach. But as we see later some of these lower abundance groups will reappear when we look at communities at the level of individual ASVs. Next we define the **Other** category and draw the graph. ### **Class abundance across sample type**: Here we generate a bar chart to look at taxonomic abundance for each of the main environment types. {data-commentary-width=300} ```{r define_other} # Here we conglomerate at 2%. Other <- means[means$mean < min_top_perc, ]$Class # or you can chose specifc taxa like this #Other_manual <- c("list", "taxa", "in", "this", "format") ``` ```{r metld_bar} meltd[meltd$Class %in% Other, ]$Class <- "Other" samp_names <- aggregate(meltd$Abundance, by = list(meltd$Sample), FUN = sum)[, 1] .e <- environment() meltd[, "Class"] <- factor(meltd[, "Class"], sort(unique(meltd[, "Class"]))) meltd <- meltd[order(meltd[, "Class"]), ] # Here we order Classes by the Phylum they belong to. #(meltd$Class) meltd$Class <- factor(meltd$Class, levels = c("Bacteroidia", "Alphaproteobacteria", "Gammaproteobacteria", "Oxyphotobacteria", "Deltaproteobacteria", "Clostridia", "Spirochaetia", "Mollicutes", "Other")) ``` ```{r plot_bar_fig2A} fig2A <- ggplot(meltd, aes_string(x = "Sample", y = "Abundance", fill = "Class"), environment = .e, ordered = TRUE, xlab = "x-axis label", ylab = "y-axis label") fig2A <- fig2A + geom_bar(stat = "identity", position = position_stack(reverse = TRUE), width = 0.95) + coord_flip() + theme(aspect.ratio = 1/2) fig2A <- fig2A + scale_fill_manual(values = friend_pal) fig2A <- fig2A + theme(axis.text.x = element_text(angle = 0, hjust = 0.45, vjust = 1)) fig2A <- fig2A + guides(fill = guide_legend(override.aes = list(colour = NULL), reverse = FALSE)) + theme(legend.key = element_rect(colour = "black")) fig2A <- fig2A + labs(x = "Environment", y = "Relative abundance (% total reads)") fig2A <- fig2A + theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_rect(colour = "black", fill = NA, size = 1)) #fig2A pdf("FIGURES/OUTPUT/Figure_X_taxa_by_environment.pdf") fig2A #invisible(dev.off()) fp <- ggplotly(fig2A, height = 300) fp %>% layout(margin = list(b = 100)) ``` *** At a 1.5% abundance cutoff, `r length(Other)` Classes are grouped into the 'Other' category. Great, now we can craft the bar chart. And here are the taxa that will go into the chart: It took some tweaking to get the bar chart to look just right---so there is a lot of code here---and it could most certainly be better :/. While we're at it we will also save a copy of the figure so we can tweak it later and make it look pretty. One thing to notice is that the sediment samples have a large percentage of ASVs grouped into the **Other** category. We will see what these taxa are in the next panel. ### **Class abundance across sample type**: Here we generate a modified bar chart separated by individual samples. {data-commentary-width=400} ```{r supp_fig1_calc, cache = TRUE} mdata_phy_all <- tax_glom(ps_slv_filt_AVG, taxrank = "Class", NArm = FALSE) # You can choose any taxonomic level here mdata_phyrel_all <- transform_sample_counts(mdata_phy_all, function(x) x/sum(x)) meltd_all <- psmelt(mdata_phyrel_all) meltd_all$Class <- as.character(meltd_all$Class) means_all <- ddply(meltd_all, ~Class, function(x) c(mean = mean(x$Abundance))) means_all$mean <- round(as.numeric(means_all$mean), digits = 5) # decending order taxa_means_all <- means_all[order(-means_all$mean), ] # ditch the sci notation taxa_means_all <- format(taxa_means_all, scientific = FALSE) top_perc_all <- top_n(taxa_means_all, n = 8, wt = mean) top_perc_all$mean <- round(as.numeric(top_perc_all$mean), digits = 5) min_top_perc_all <- round(as.numeric(min(top_perc_all$mean)), digits = 5) Other <- means_all[means_all$mean < min_top_perc_all, ]$Class # Here we conglomerate at 2%. meltd_all[meltd_all$Class %in% Other, ]$Class <- "Other" samp_names <- aggregate(meltd_all$Abundance, by = list(meltd_all$Sample), FUN = sum)[, 1] .e <- environment() meltd_all[, "Class"] <- factor(meltd_all[, "Class"], sort(unique(meltd_all[, "Class"]))) meltd_all <- meltd_all[order(meltd_all[, "Class"]), ] #levels(meltd_all$Class) meltd_all$Class <- factor(meltd_all$Class, levels = c( "Bacteroidia", "Alphaproteobacteria", "Gammaproteobacteria", "Oxyphotobacteria", "Clostridia", "Deltaproteobacteria", "Spirochaetia", "Mollicutes", "Other")) # Here we order Classes by the Phylum they belong to. ``` ```{r supp_fig1, include=FALSE, echo=FALSE, eval=FALSE} sup_fig1 <- qplot(data = meltd_all, x = ENV, y = Abundance, fill = Class, geom = "boxplot", ylab = "Relative Abundance") + theme(legend.position = "none") + facet_grid(Class ~ ., scales = "free_y", space = "free_y") + geom_jitter(width = 0.05) + geom_point(colour = "black", fill = "white") #+ guides(guide_legend(reverse = FALSE) ) sup_fig1 <- sup_fig1 + scale_fill_manual(values = friend_pal) + labs(x = "Environment", y = "Relative abundance (% total reads)") qp <- ggplotly(sup_fig1, height = 300) qp %>% layout(margin = list(b = 100)) #sup_fig1 pdf("FIGURES/OUTPUT/Figure_X_BAR.pdf") sup_fig1 #invisible(dev.off()) ``` ```{r separate_bars_stacked} options(scipen = 3) options(digits = 4) sup_fig2 <- ggplot(meltd_all, aes_string(x = "Sample", y = "Abundance", fill = "Class"), environment = .e, Ordered = TRUE) sup_fig2 <- sup_fig2 + geom_bar(stat = "identity", position = "stack") + facet_grid(Class ~ ENV, scales = "free", space = "free_y") sup_fig2 <- sup_fig2 + scale_fill_manual(values = friend_pal) # sup_fig2 <- sup_fig2 + theme(axis.text.x = element_text(angle = -90, # hjust = 0)) sup_fig2 <- sup_fig2 + theme(axis.text.x = element_blank()) sup_fig2 <- sup_fig2 + guides(fill = guide_legend(override.aes = list(colour = NULL), reverse = FALSE)) + theme(legend.key = element_rect(colour = "black"), legend.position = "none") + labs(x = "Individual samples", y = "Relative abundance (% total reads)") zp <- ggplotly(sup_fig2, height = 600) zp %>% layout(margin = list(b = 100)) #plotly::ggplotly(sup_fig2, tooltip = c('Sample', 'Abundance'), #originalData = TRUE, width = 600, height = 800) #sup_fig2 pdf("FIGURES/OUTPUT/Figure_X_separate_bar.pdf") sup_fig2 #invisible(dev.off()) ``` *** Here we see the same data as in the last figure except now presented as individual samples. We see that the sediment samples all have a larger percentage of *Other* taxa. If we examine Phylum level abundance in sediment samples we see several groups that are not well represented in the rest of the dataset including ```{r sed_calc} ps_slv_filt_AVG_sed <- prune_samples(c("SCC1", "SCC2", "SCC3", "SCR1", "SCR2", "SCR3"), ps_slv_filt_AVG) ps_slv_filt_AVG_sed <- prune_taxa(taxa_sums(ps_slv_filt_AVG_sed) > 0, ps_slv_filt_AVG_sed) mdata_phy_sed <- tax_glom(ps_slv_filt_AVG_sed, taxrank = "Phylum", NArm = FALSE) mdata_phyrel_sed <- transform_sample_counts(mdata_phy_sed, function(x) x/sum(x)) meltd_sed <- psmelt(mdata_phyrel_sed) meltd_sed$Class <- as.character(meltd_sed$Phylum) means_sed <- ddply(meltd_sed, ~Phylum, function(x) c(mean = mean(x$Abundance))) means_sed$mean <- round(as.numeric(means_sed$mean), digits = 5) mdata_phy_comp <- tax_glom(ps_slv_filt_AVG, taxrank = "Phylum", NArm = FALSE) mdata_phyrel_comp <- transform_sample_counts(mdata_phy_comp, function(x) x/sum(x)) meltd_comp <- psmelt(mdata_phyrel_comp) meltd_comp$Class <- as.character(meltd_comp$Phylum) means_comp <- ddply(meltd_comp, ~Phylum, function(x) c(mean = mean(x$Abundance))) means_comp$mean <- round(as.numeric(means_comp$mean), digits = 5) merge_sed_comp <- merge(means_sed, means_comp, by = "Phylum", all = TRUE) merge_sed_comp <- merge_sed_comp[order(-merge_sed_comp$mean.x), ] colnames(merge_sed_comp) <- c("Phylum", "mean Sed samples", "mean All samples") kable(merge_sed_comp, row.names = FALSE) %>% kable_styling(bootstrap_options = "striped", full_width = FALSE) %>% column_spec(1:3, width = "2.5cm") ``` ### **$\alpha$-diversity estimates** Here we use several diversity indices and add the results to the summary table. ```{r gen_summary_table, warning = FALSE} diversity <- estimate_richness(ps_slv_work_filt, measures=c( "Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher")) diversity_calc <- diversity %>% rownames_to_column("Sample_ID") # round values diversity_calc[c(3,5,10)] <- round(diversity_calc[c(3,5,10)], 1) diversity_calc[c(4,6,7,9)] <- round(diversity_calc[c(4,6,7,9)], 2) diversity_calc[8] <- round(diversity_calc[8], 3) #host_summary <- merge(host_details, diversity_calc) #host_summary$Observed <- NULL merge_tab2$total_ASVs <- NULL sample_table <- merge(merge_tab2, diversity_calc, by = "Sample_ID") datatable(sample_table, rownames = FALSE, width = "100%", colnames = c("Sample_ID", "Environment", "Site", "total reads", "No. ASVs", "Chao1", "Chao1 (SE)", "ACE", "ACE (SE)", "Shannon", "Simpson", "InvSimpson", "Fisher"), caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;", "Table NA: ", htmltools::em("Alpha diversity estimates.")), extensions = "Buttons", options = list(columnDefs = list(list(className = "dt-center", targets = "_all")), dom = "Blfrtip", buttons = c("csv", "copy"), scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE, lengthMenu = c(10, 20, 30))) write.table(sample_table, "DATA_TABLES/OUTPUT/Table_X_alpha_diversity.txt", sep = "\t", row.names = FALSE, quote = FALSE) ``` *** Alpha diversity describes the diversity in a sample or site. There are several alpha diversity metrics available in phyloseq: `Observed`, `Chao1`, `ACE`, `Shannon`, `Simpson`, `InvSimpson`, `Fisher`. Play around to see how different metrics change or confirm these results. Here we want to know if diversity is significantly different across environments. In order to do that we need to know if we should run a parametric or non-parametric test, and for that we need to know if our data is normally distributed. Most of the ideas/code for alpha (and subsequent beta) diversity statistics come from this [workshop tutorial](https://rpubs.com/maddieSC/R_SOP_UCR_Jan_2018){target="_blank"} by Kim Dill-McFarland and Madison Cox. First we run the diversity estimates, add these data to our summary table, and save a copy of this table. ### **$\alpha$-diversity statistics**: Test whether the results are normally distributed. Results of this will help guide the analyses we do next. {data-commentary-width=400} ```{r alpha_div_test_norm, warning = FALSE} # Convert to ps object sample_div <- sample_data(diversity) # Create new ps object with diversity estimates added to sample_data ps_slv_work_filt_div <- merge_phyloseq(ps_slv_work_filt, sample_div) # Run Shapiro test shapiro_test_Shan <- shapiro.test(sample_data(ps_slv_work_filt_div)$Shannon) shapiro_test_invSimp <- shapiro.test(sample_data(ps_slv_work_filt_div)$InvSimpson) shapiro_test_Chao1 <- shapiro.test(sample_data(ps_slv_work_filt_div)$Chao1) shapiro_test_Observed <- shapiro.test(sample_data(ps_slv_work_filt_div)$Observed) ``` >Shapiro-Wilk Normality Test for **Shannon** index. ```{r shap_Shan, echo = FALSE} shapiro_test_Shan ``` >Shapiro-Wilk Normality Test for **inverse Simpson** index. ```{r shap_invS, echo = FALSE} shapiro_test_invSimp ``` >Shapiro-Wilk Normality Test for **Chao1** richness estimator. ```{r shap_Choa1, echo = FALSE} shapiro_test_Chao1 ``` > Shapiro-Wilk Normality Test for **Observed ASV richness** estimator. ```{r shap_Observed, echo = FALSE} shapiro_test_Observed ``` *** Next, we add the diversity estimates to our phyloseq object, and test if the data are normally distributed using Shapiro-Wilk Normality test. We will focus on the inverse Simpson and Shannon diversity estimates, the Chao richness estimate, and Observed richness but this approach can be used for any metric. Ok, since the **p-values are significant** for the inverse Simpson, Chao richness, and Observed ASV richness we **reject the null hypothesis** that these data are normally distributed. However, the Shannon estimates appear normally distributed. So lets see if diversity is significantly different between environment based on the Shannon index. ### **Assessing significance of $\alpha$-diversity metrics**: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.{data-commentary-width=500} >Normally distributed metrics ```{r normal} sampledataDF <- data.frame(sample_data(ps_slv_work_filt_div)) sampledataDF$ENV <- as.factor(sampledataDF$ENV) sampledataDF$SITE <- as.factor(sampledataDF$SITE) aov.shannon = aov(Shannon ~ ENV, data = sampledataDF) #Call for the summary of that ANOVA, which will include P-values summary(aov.shannon) ``` ```{r tukey} TukeyHSD(aov.shannon) ``` >Non-normally distributed metrics Kruskal-Wallis of **inverse Simpson** index. ```{r krusk_invsimp} #library(FSA) #dunnTest(InvSimpson ~ Sp, data = sampledataDF, method="bh") kruskal.test(InvSimpson ~ ENV, data = sampledataDF) ``` Pairwise significance test for **inverse Simpson** index. ```{r wilcox_invsimp} pairwise.wilcox.test(sampledataDF$InvSimpson, sampledataDF$ENV, p.adjust.method="fdr") ``` Kruskal-Wallis of **Chao1 richness** estimator. ```{r krusk_chao} #dunnTest(Chao1 ~ Sp, data = sampledataDF, method="bh") kruskal.test(Chao1 ~ ENV, data = sampledataDF) ``` Pairwise significance test for **Chao1 richness** estimator. ```{r wilcox_chao} pairwise.wilcox.test(sampledataDF$Chao1, sampledataDF$ENV, p.adjust.method="fdr") ``` Kruskal-Wallis of **Observed ASV richness** index. ```{r krusk_observed} #library(FSA) #dunnTest(Observed ~ Sp, data = sampledataDF, method="bh") kruskal.test(Observed ~ ENV, data = sampledataDF) ``` Pairwise significance test for **Observed ASV richness** index. ```{r wilcox_observed, warning = FALSE} pairwise.wilcox.test(sampledataDF$Observed, sampledataDF$ENV, p.adjust.method="fdr") ``` *** Since the Shannon data is normally distributed we can test for significance using ANOVA (a parametric test). Ok, the results of the ANOVA are significant. Here we use the Tukey's HSD (honestly significant difference) post-hoc test to determine which pairwise comparisons are different. Looks like everything is significantly different except for Mat-Sediment communities from Cayo Roldan. Interesting. Now we can look at the results on the inverse Simpson diversity and Chao's richness. Since Environment is categorical, we use Kruskal-Wallis (non-parametric equivalent of ANOVA) to test for significance. For the inverse Simpson, Chao1, and Observed richness the results of the Kruskal-Wallis rank sum test are significant. So we can look at pairwise comparisons using Wilcoxon rank sum test for post-hoc analysis. ### **$\alpha$-diversity plots**: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.{data-commentary-width=300} >Alpha diversity of microbial communities ```{r alpha_div_fig_2B, warning = FALSE} samp_pal <- c("#CC79A7", "#E69F00", "#009E73", "#56B4E9") fig2B <- plot_richness(ps_slv_work_filt, x = "ENV", measures = c("Observed", "Shannon", "InvSimpson", "Chao1"), color = "ENV", nrow = 1) fig2B <- fig2B + geom_boxplot() + geom_jitter(width = 0.05) fig2B <- fig2B + scale_colour_manual(values = samp_pal) + labs(x = "Environment", y = "Diversity") #fig2B + geom_boxplot(aes(colour = black)) #fig2B <- fig2B + theme_bw() + geom_point(size = 2.5, aes(color = ENV)) + ap <- ggplotly(fig2B, height = 400) ap %>% layout(margin = list(b = 100)) #fig2B pdf("FIGURES/OUTPUT/Figure_X_richness.pdf") fig2B #invisible(invisible(dev.off())) ``` *** Here we plot the results of each diversity index and include the *Observed* ASV richness. We will save a copy of the figure for later tweaking. We use the color palette described above to delineate environments. ### **$\beta$-diversity analysis**: NMDS coupled with Jensen-Shannon divergence. {data-commentary-width=400} ```{r run_nmds, results = "hide"} set.seed(3131) ord.nmds.jsd_slv <- ordinate(ps_slv_work_filt, method = "NMDS", distance = "bray") ``` ```{r ord_results} ord.nmds.jsd_slv ``` *** Beta diversity basically tells us how similar or dissimilar samples are to one another. Phyloseq offers several ordination option using the `method` flag and `distance` metrics. Here we use non metric multidimensional scaling (NMDS) coupled with Jensen-Shannon divergence. We also save a copy of the figure for later tweaking. We see that a convergent solution was reached around 20 iterations and our stress is below 0.20, meaning that 2-axes are sufficient to view the data. Generally, we are looking for stress values below 0.2. If the stress values are high, you may need to add more axes to the ordination. Lets visualize the plot. ### **$\beta$-diversity NMDS plot **: {data-commentary-width=400} ```{r beta_div_fig_2C} fig2C <- plot_ordination(ps_slv_work_filt, ord.nmds.jsd_slv, color = "ENV", shape = "SITE") fig2C <- fig2C + geom_point(size = 4.6, stroke = 0.075) + scale_colour_manual(values = samp_pal) #fig2C <- fig2C + scale_colour_manual(values = samp_pal) fig2C <- fig2C + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), panel.grid.major = element_line("grey"), panel.grid.minor = element_line("grey"), panel.border = element_rect(colour = "black", fill = NA, size = 1)) + theme(legend.key = element_rect(colour = "black")) #fig2C <- fig2C + coord_fixed() wp <- ggplotly(fig2C, height = 600) wp %>% layout(margin = list(b = 100)) #fig2C <- fig2C + stat_ellipse(type = "t") + theme_bw() #plotly::ggplotly(fig2C, tooltip = 'ENV', originalData = TRUE) #fig2C pdf("FIGURES/OUTPUT/Figure_X_NMDS.pdf") fig2C invisible(dev.off()) ``` *** So we can see some clustering within groups and spread between groups, but this is not a test for statistical differences. Do microbial communities differ significantly by environment? ### **$\beta$-diversity statistics **: {data-commentary-width=400} First we use the `adonis` function in vegan to run a PERMANOVA test. This will tell us whether environments have similar centroids or not. ```{r ordination_stats_adonis} set.seed(1911) fish.jsd <- phyloseq::distance(ps_slv_work_filt, method = "jsd") sampledf <- data.frame(sample_data(ps_slv_work_filt)) fish_adonis <- adonis(fish.jsd ~ ENV, data = sampledf, permutations = 1000) fish_adonis ``` These results indicate that centroids are significantly different across environments meaning that communities are different by environments. We can also use the `pairwiseAdonis` package for pair-wise PERMANOVA analysis. ```{r pairwise_adonis} pairwise.adonis(fish.jsd, factors = sampledf$ENV, p.adjust.m = "bonferroni") ``` Here we see again we see that communities are different by environments. However, PERMANOVA assumes equal beta dispersion so we will use the `betadisper` function from the `vegan` package to calculate beta dispersion values. ```{r betadisper} beta_adonis <- betadisper(fish.jsd, sampledf$ENV, bias.adjust = TRUE) beta_adonis ``` And then a pair-wise Permutation test for homogeneity of multivariate dispersions using `permutest` (again from the `vegan` package). ```{r permutest} permutest(beta_adonis, pairwise=TRUE, permutations=1000) ``` These results are significant, meaning that environments have different dispersions. Looking at the pairwise p-values and permuted p-value, we see that the significant differences (p-value < 0.05) are between most environments. This means we are less confident that the PERMANOVA result is a real result, and that the result is possibly due to differences in group dispersions. We can also use Analysis of Similarity (ANOSIM)---which does not assume equal group variances---to test whether overall microbial communities differ by environment. ```{r ordination_stats_anosim} spgroup <- get_variable(ps_slv_work_filt, "ENV") fish_anosim <- anosim(distance(ps_slv_work_filt, "jsd"), grouping = spgroup) summary(fish_anosim) ``` And the AN0SIM result is significant meaning that environment influences microbial community composition. ```{r simper, eval = FALSE, echo = FALSE, include = FALSE} source("/Users/j/Desktop/FISH_MICROBIOME/02_PHYLOSEQ_WORKFLOW/HELPER_SCRIPTS/simper_pretty.R") #Using the function otutab <- as.table(otu_table(ps_slv_work_filt)) stuff <- simper.pretty(otu_table(ps_slv_work_filt), sample_data(ps_slv_work_filt), "Sp", perc_cutoff=0.5, low_cutoff = 'y', low_val=0.01, 'name') ``` *** To test whether microbial communities differ by environment we can use permutational analysis of variance (PERMANOVA) or analysis of similarity (ANOSIM). PERMANOVA does not assume normality but does assume equal beta dispersion between groups. We will test beta dispersion below. Water {.storyboard} ========================================= ### **Site differences**: Here we look at differences between the two sites. For {data-commentary-width=400} ```{r, echo=FALSE} # Working directory # Load SV table water_seq_table <- read.csv("DATA_TABLES/INPUT/water_seq_table.csv", stringsAsFactors = FALSE, row.names = 1) # SV table water_seq_table <- as.data.frame(t(water_seq_table)) # Delete columns when sum == 0 water_seq_table <- water_seq_table[, which(colSums(water_seq_table) != 0)] # Add First column with name of each group water_seq_table <- cbind(DO = c("normoxic","normoxic","normoxic","normoxic","normoxic" ,"hypoxic","hypoxic","hypoxic"), water_seq_table) library(labdsv) iva <- indval(water_seq_table[,-1], water_seq_table[,1]) gr <- iva$maxcls[iva$pval<=0.05] iv <- iva$indcls[iva$pval<=0.05] pv <- iva$pval[iva$pval<=0.05] fr <- apply(water_seq_table[,-1]>0, 2, sum)[iva$pval<=0.05] indvalsummary <- data.frame(group=gr, indval=iv, pvalue=pv, freq=fr) indvalsummary <- indvalsummary[order(indvalsummary$group, -indvalsummary$indval),] write.csv(indvalsummary, "DATA_TABLES/OUTPUT/indvalsummary.csv") # Load Tax table water_tax_table <- read.csv("DATA_TABLES/INPUT/water_tax_table.csv", stringsAsFactors = FALSE, row.names = 1) # tax table # Merge indvalsummary with tax names indvalsummary_tax <- merge(indvalsummary, water_tax_table, by="row.names", all=TRUE) # Drop rows with NAs indvalsummary_tax <- indvalsummary_tax[!(is.na(indvalsummary_tax$group)),] # Rename leves of group factor # indvalsummary_tax <- as.character(indvalsummary_tax$group) ``` ```{r, include = FALSE} lapply(indvalsummary_tax, class) class(indvalsummary_tax$group) = "character" indvalsummary_tax$group <- ifelse(indvalsummary_tax$group == "1", as.character("hypoxic"), as.character("normoxic")) ``` ```{r} datatable(indvalsummary_tax, rownames = FALSE, width = "100%", caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;", "Table NA: ", htmltools::em("DESCRIPTION.")), extensions = "Buttons", options = list(columnDefs = list(list(className = "dt-center", targets = "_all")), dom = "Blfrtip", buttons = c("csv", "copy"), scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE, lengthMenu = c(10, 50, 160))) write.csv(indvalsummary_tax, "DATA_TABLES/OUTPUT/Hypoxia_indvalsummaryWATER_tax.csv", row.names = F) ``` *** ```{r} asv_list <- as.vector(indvalsummary_tax$Row.names) # Create ps objects da_asvs <- prune_taxa(asv_list, ps_slv_water) da_asvs_full <- prune_taxa(asv_list, ps_slv_work_filt) # Create fasta file from tax_table table2format <- tax_table(da_asvs) #retain only the column with the sequences table2format_trim <- table2format[, 7] table2format_trim_df <- data.frame(row.names(table2format_trim), table2format_trim) colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ") table2format_trim_df$ASV_ID <- sub("ASV" , ">ASV", table2format_trim_df$ASV_ID) #format fasta write.table(table2format_trim_df, "DATA_TABLES/OUTPUT/da_asv.fasta", sep = "\r", col.names = FALSE, row.names = FALSE, quote = FALSE, fileEncoding = "UTF-8") ```